class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Data Types, Import & Export ### Johannes Breuer & Stefan Jünger ### 2021-08-02 --- layout: true --- ## Getting data into `R` Thus far, we've already learned what `R` and `RStudio` are. This course is about starting to use `R` and feeling prepared to use it for statistical analyses. There's one essential prerequisite: .center[**We need data!**] <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\import_data.png" width="50%" style="display: block; margin: auto;" /> --- ## Content of this session - What are `R`'s internal data types? - How to work with different data types? - How to import data in different formats? - How to export data in different formats --- ## Data we use in this course During the course, we use several different datasets. Mainly in this session, where we apply different importing functions, we use a large variety ranging from data about the Titanic to data about unicorns. However, we will also use data that are (presumably) more interesting for social and behavioral scientists. --- ## It boils down to... .pull-left[ **How your data are stored (data types)** - 'Numbers' (Integers & Doubles) - Character Strings - Logical - Factors - ... - There's more, e.g., expressions, but let's leave it at that ] .pull-right[ **Where your data are stored (data formats)** - Vectors - Matrices - Arrays - Data frames / Tibbles - Lists ] .footnote[https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf] --- ## Numeric data .small[ *Integers* are values without a decimal value. To be explicit in `R` in using them, you have to place an `L` behind the actual value, just like that: ```r 1L ``` ``` ## [1] 1 ``` By contrast, *doubles* are values with a decimal value. ```r 1.1 ``` ``` ## [1] 1.1 ``` We can check data types by using the `typeof()` function. ```r typeof(1L) ``` ``` ## [1] "integer" ``` ```r typeof(1.1) ``` ``` ## [1] "double" ``` ] --- ## Character strings At first glance, a *character* is a letter somewhere between a-z. *String* in this context might mean that we have a series of characters. However, numbers and other symbols can be part of a *character string*, which can then be, e.g., part of a text. In `R`, character strings are wrapped in quotation marks. ```r "Hi. I am a character string, the 1st of its kind!" ``` ``` ## [1] "Hi. I am a character string, the 1st of its kind!" ``` So character strings are meaningless, which means that there are no values associated with their content unless we change that, e.g., with factors. --- ## Factors If you're a *Stata* (or *SPSS*) user, you may already be quite familiar with factors. Factors are data types that assume that their values are not continuous, e.g., as in ordinal or nominal data. They are useful when inserted into regression models, as we will see later in this course. ```r factor(1.1) ``` ``` ## [1] 1.1 ## Levels: 1.1 ``` ```r factor("Hi. I am a character string, the 1st of its kind!") ``` ``` ## [1] Hi. I am a character string, the 1st of its kind! ## Levels: Hi. I am a character string, the 1st of its kind! ``` Factors take numeric data or character strings as input as they simply convert them into so-called levels. This concept may be a little bit abstract for the time being. It's just essential to have heard about them before you learn more about them in the data wrangling session. --- ## Logical values Logical values are basically either `TRUE` or `FALSE` values. These values are produced by making logical requests on your data. ```r 2 > 1 ``` ``` ## [1] TRUE ``` ```r 2 < 1 ``` ``` ## [1] FALSE ``` I'd say that logical values are at the heart of creating loops. For this purpose, we need way more logical operators to request `TRUE` or `FALSE` values. --- ## Logical operators Here are all (?) logical operators in `R`: - `<` less than - `<=` less than or equal to - `>` greater than - `>=` greater than or equal to - `== ` exactly equal to - `!=` not equal to - `!x` Not x - `x | y` x OR y - `x & y ` x AND y - `isTRUE(x)` test if X is TRUE - `isFALSE(x)` test if X is FALSE .footnote[https://www.statmethods.net/management/operators.html] Moreover, there are some more `is.PROPERTY_ASKED_FOR()` functions, such as `is.numeric()`, which also return `TRUE` or `FALSE` values. --- ## `R`'s data formats `R`'s different data types can be put into 'container's. <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\9213.1526125966.png" width="75%" style="display: block; margin: auto;" /> .footnote[https://devopedia.org/r-data-structures] --- ## Vectors Vectors are built by enclosing your content with `c()` ("c" for "concatenate") ```r numeric_vector <- c(1, 2, 3, 4) character_vector <- c("a", "b", "c", "d") numeric_vector ``` ``` ## [1] 1 2 3 4 ``` ```r character_vector ``` ``` ## [1] "a" "b" "c" "d" ``` Vectors are really like vectors in mathematics. Initially, it doesn't matter if you look at them as column or row vectors. --- ## ...but it matters when you combine vectors Using the function `cbind()` or `rbind()` you can either combine vectors column-wise or row-wise, respectively. ```r cbind(numeric_vector, character_vector) ``` ``` ## numeric_vector character_vector ## [1,] "1" "a" ## [2,] "2" "b" ## [3,] "3" "c" ## [4,] "4" "d" ``` ```r rbind(numeric_vector, character_vector) ``` ``` ## [,1] [,2] [,3] [,4] ## numeric_vector "1" "2" "3" "4" ## character_vector "a" "b" "c" "d" ``` They are now matrices (also numeric values are coerced into strings). --- ## Matrices Matrices are the basic rectangular data format in R. ```r fancy_matrix <- matrix(1:16, nrow = 4) fancy_matrix ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16 ``` You cannot store multiple data types, such as strings and numeric values in the same matrix. Otherwise, your data will get coerced to a common type, as seen in the previous slide. - In fact, this is something that happens already within vectors. ```r c(1, 2, "evil string") ``` ``` ## [1] "1" "2" "evil string" ``` --- ## Data frames While matrices are used, e.g.,--\*drumroll\*-- for matrix operations, data frames resemble more the data formats most of you are probably already familiar with. We can build data frames by hand as here: .tinyish[ ```r library(randomNames) # a name generator package fancy_data <- data.frame( who = randomNames(n = 10, which.names = "first"), age = sample(14:49, 10, replace = TRUE), # you see what we are doing here? salary_2018 = sample(15:100, 10, replace = TRUE), salary_2019 = sample(15:100, 10, replace = TRUE) ) fancy_data ``` ] .right[↪️] --- class: middle ``` ## who age salary_2018 salary_2019 ## 1 Joanne 19 100 91 ## 2 Aakif 22 82 32 ## 3 Ryan 39 37 64 ## 4 Alyssa 39 50 40 ## 5 Nadeem 49 73 30 ## 6 Christine 32 77 91 ## 7 Mushtaaq 17 94 99 ## 8 Arafaat 49 91 37 ## 9 Kianna 16 83 44 ## 10 Anisa 49 68 27 ``` --- ## Tibbles .pull-left[ Tibbles are basically just `R data.frames` but nicer. - only the first ten observations are printed - output is tidier! - you get some additional metadata about rows and columns that you would normally only get when using `dim()` and other functions You can check the [tibble vignette](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) for technical details. ] .pull-right[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\tibble.png" width="60%" style="display: block; margin: auto;" /> ] --- class: middle ```r library(tibble) as_tibble(fancy_data) ``` ``` ## # A tibble: 10 x 4 ## who age salary_2018 salary_2019 ## <chr> <int> <int> <int> ## 1 Joanne 19 100 91 ## 2 Aakif 22 82 32 ## 3 Ryan 39 37 64 ## 4 Alyssa 39 50 40 ## 5 Nadeem 49 73 30 ## 6 Christine 32 77 91 ## 7 Mushtaaq 17 94 99 ## 8 Arafaat 49 91 37 ## 9 Kianna 16 83 44 ## 10 Anisa 49 68 27 ``` --- ## One last type you should know: lists Lists are perfect for storing numerous and potentially diverse information in one place. ```r fancy_list <- list( numeric_vector, character_vector, fancy_matrix, fancy_data ) fancy_list ``` .right[↪️] --- class: middle .tinyish[ ``` ## [[1]] ## [1] 1 2 3 4 ## ## [[2]] ## [1] "a" "b" "c" "d" ## ## [[3]] ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16 ## ## [[4]] ## who age salary_2018 salary_2019 ## 1 Joanne 19 100 91 ## 2 Aakif 22 82 32 ## 3 Ryan 39 37 64 ## 4 Alyssa 39 50 40 ## 5 Nadeem 49 73 30 ## 6 Christine 32 77 91 ## 7 Mushtaaq 17 94 99 ## 8 Arafaat 49 91 37 ## 9 Kianna 16 83 44 ## 10 Anisa 49 68 27 ``` ] --- ## Nested lists ```r fancy_nested_list <- list( fancy_vectors = list(numeric_vector, character_vector), data_stuff = list(fancy_matrix, fancy_data) ) fancy_nested_list ``` .right[↪️] --- class: middle .tinyish[ ``` ## $fancy_vectors ## $fancy_vectors[[1]] ## [1] 1 2 3 4 ## ## $fancy_vectors[[2]] ## [1] "a" "b" "c" "d" ## ## ## $data_stuff ## $data_stuff[[1]] ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16 ## ## $data_stuff[[2]] ## who age salary_2018 salary_2019 ## 1 Joanne 19 100 91 ## 2 Aakif 22 82 32 ## 3 Ryan 39 37 64 ## 4 Alyssa 39 50 40 ## 5 Nadeem 49 73 30 ## 6 Christine 32 77 91 ## 7 Mushtaaq 17 94 99 ## 8 Arafaat 49 91 37 ## 9 Kianna 16 83 44 ## 10 Anisa 49 68 27 ``` ] --- ## Accessing elements by index Generally, there is this use of `[index_number]`-logic in `R` to access only a subset of information in data, no matter if we have vectors or data frames. Say, we want to extract the second element of our `character_vector` object, we could do that like this: ```r character_vector[2] ``` ``` ## [1] "b" ``` --- # More complicated cases: matrices Matrices can have more dimensions, often you want information from a specific row and column. ```r a_wonderful_matrix[number_of_row, number_of_column] ``` *Note*: You can do the same indexing with `data.frame`s --- ## Matrices and subscripts (as in mathematical notation) Identify rows, columns, or elements using subscripts is similar to matrix notation: ```r fancy_matrix[, 4] # 4th column of matrix fancy_matrix[3,] # 3rd row of matrix fancy_matrix[2:4, 1:3] # rows 2,3,4 of columns 1,2,3 ``` It's really like in math, and you can perform standard mathematical operations, such as matrix multiplications. ```r fancy_matrix[2:4, 1:3] %*% fancy_matrix[1:3, 2:4] ``` ``` ## [,1] [,2] [,3] ## [1,] 116 188 260 ## [2,] 134 218 302 ## [3,] 152 248 344 ``` --- ## The case of data frames A nice feature of `data.frames` or `tibbles` is that their columns are names, just as variable names in ordinary data. It would be cumbersome do use index numbers to extract a specific column/variable, right? Do not fear: ```r fancy_data$who ``` ``` ## [1] "Joanne" "Aakif" "Ryan" "Alyssa" "Nadeem" "Christine" "Mushtaaq" "Arafaat" ## [9] "Kianna" "Anisa" ``` Just place a `$`-sign between the data object and the variable name. --- ## `[]` in data frames Sometimes we also have to rely on character strings as input information, e.g., for iterating over data. We can also use `[]` to access variables by name. .pull-left[ Not only this way: ```r fancy_data[1] ``` ``` ## who ## 1 Joanne ## 2 Aakif ## 3 Ryan ## 4 Alyssa ## 5 Nadeem ## 6 Christine ## 7 Mushtaaq ## 8 Arafaat ## 9 Kianna ## 10 Anisa ``` ] .pull-right[ But also this way: ```r fancy_data["who"] ``` ``` ## who ## 1 Joanne ## 2 Aakif ## 3 Ryan ## 4 Alyssa ## 5 Nadeem ## 6 Christine ## 7 Mushtaaq ## 8 Arafaat ## 9 Kianna ## 10 Anisa ``` ] --- ## Dataframe check 1, 2, 1, 2! Even before checking the codebook for a dataset (if there is one) it always helps to have a quick look at the data. The most high-level information you can get is about the object type and its dimensions. .small[ ```r # object type class(fancy_data) ``` ``` ## [1] "data.frame" ``` ```r # number of rows and columns dim(fancy_data) ``` ``` ## [1] 10 4 ``` ```r # number of rows nrow(fancy_data) ``` ``` ## [1] 10 ``` ```r # number of columns ncol(fancy_data) ``` ``` ## [1] 4 ``` ] --- ## Dataframe check 1, 2, 1, 2! You can also print the first 6 lines of the dataframe with `head()`. You can easily change the number of lines by providing the number as the second argument to the `head()` function. ```r head(fancy_data, 3) ``` ``` ## who age salary_2018 salary_2019 ## 1 Joanne 19 100 91 ## 2 Aakif 22 82 32 ## 3 Ryan 39 37 64 ``` --- ## Dataframe check 1, 2, 1, 2! If we want some more (detailed) information about the dataset or object, we can use the `base R` function `str()`. ```r str(fancy_data) ``` .right[↪️] --- .smaller[ ``` ## 'data.frame': 10 obs. of 4 variables: ## $ who : chr "Joanne" "Aakif" "Ryan" "Alyssa" ... ## $ age : int 19 22 39 39 49 32 17 49 16 49 ## $ salary_2018: int 100 82 37 50 73 77 94 91 83 68 ## $ salary_2019: int 91 32 64 40 30 91 99 37 44 27 ``` ] --- ## Dataframe check 1, 2, 1, 2! As you saw on the previous slide, the output of `str()` can be a bit hard to read (especially for larger datasets). A good alternative that creates a more clearly laid output is the `glimpse()` function from the `dplyr` package. ```r glimpse(fancy_data) ``` .right[↪️] --- .smaller[ ``` ## Rows: 10 ## Columns: 4 ## $ who <chr> "Joanne", "Aakif", "Ryan", "Alyssa", "Nadeem", "Christine", "Mushtaaq", "Arafaat", "K~ ## $ age <int> 19, 22, 39, 39, 49, 32, 17, 49, 16, 49 ## $ salary_2018 <int> 100, 82, 37, 50, 73, 77, 94, 91, 83, 68 ## $ salary_2019 <int> 91, 32, 64, 40, 30, 91, 99, 37, 44, 27 ``` ] --- ## Dataframe check 1, 2, 1, 2! If you want to have a look at your full dataset, you can use the `View()` function. In *RStudio*, this will open a new tab in the source pane through which you can explore the dataset (including a search function). You can also click on the small spreadsheet symbol on the right side of the object in the environment tab to open this view. ```r View(fancy_data) ``` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\rstudio_view.png" width="65%" style="display: block; margin: auto;" /> --- ## Difference between `[]` and `[[]]` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\indexing_lists.png" width="1832" style="display: block; margin: auto;" /> .footnote[https://twitter.com/hadleywickham/status/643381054758363136/photo/1] --- ## Receiving and defining names We can print all names of an object using the `names()` function... ```r names(fancy_data) ``` ``` ## [1] "who" "age" "salary_2018" "salary_2019" ``` ...and we can change its names with it. ```r names(fancy_data) <- c("name", "salary_2018", "salary_2019") names(fancy_data) ``` ``` ## [1] "name" "salary_2018" "salary_2019" NA ``` However, there are more flexible ways of doing this as we will see later. --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_1_2_1_Data_Types.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_1_2_1_Data_Types.html) --- ## GESIS Panel Data on the Coronavirus Outbreak .left-column[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\gesis_panel_logo_web.jpg" width="372" style="display: block; margin: auto;" /> ] .right-column[ For most of the examples and exercises in this course we will use the [Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany](https://www.gesis.org/gesis-panel/coronavirus-outbreak/public-use-file-puf). You can [download the dataset in different formats as well as the codebook and the questionnaire (in German) from the *GESIS* Data Archive](https://search.gesis.org/research_data/ZA5667) (note: you need to have/create a user account). The *GESIS Panel* website provides [detailed documentation](https://www.gesis.org/gesis-panel/documentation), including a [cheatsheet](https://www.gesis.org/fileadmin/upload/forschung/programme_projekte/Drittmittelprojekte/GESIS_Panel/gesis_panel_cheatsheet.pdf). ] --- ## Gapminder Data .left-column[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\gapminder_logo.png" width="1200" style="display: block; margin: auto;" /> ] .right-column[ We will also use [data from *Gapminder*](https://www.gapminder.org/data/). During the course and the exercises, we work with data we have downloaded from their website. There also is an `R` package that bundles some of the *Gapminder* data: `install.packages("gapminder")`. This `R` package provides ["[a]n excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."](https://cran.r-project.org/web/packages/gapminder/index.html) ] --- ## How to use the data in general To code along and be able to do the exercises, you should store the data files for the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* in a folder called `data` in the same folder as the other materials for this course. The *Gapminder* data (as well as the Titanic and unicorn) should already be in the `data` folder (if you downloaded/cloned the materials from the [*GitHub* repo for this course](https://github.com/jobreu/r-intro-gesis-2021)). We also provide you a synthetic data set based on the GESIS Panel data. This synthetic data set was created by [Bernd Weiß](https://berndweiss.net/) using the [`synthpop` package](https://www.synthpop.org.uk/). --- ## `R` is data-agnostic <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\Datenimport.PNG" width="65%" style="display: block; margin: auto;" /> --- ## But the choice of packages is intimidating .pull-left[ **What you will learn** - Getting the most common data formats into `R` - e.g., CSV, *Stata*, *SPSS*, or *Excel* spreadsheets - Using the most recent methods of doing that - We will rely a lot on packages and functions from the `tidyverse` instead of using `base R` ] .pull-right[ **What you won't learn** - Getting old & obscure binary data formats into `R` - ... although [it's possible](https://cran.r-project.org/doc/manuals/r-release/R-data.html) ] --- ## Before writing any code: *RStudio* functionality to import data `R` is no longer just for command line heroes. In the *RStudio* IDE menu, you can also select and load your data using the mouse. It's under `Environment - Import Dataset - Choose file type`. <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\rstudio_import.PNG" width="716" style="display: block; margin: auto;" /> --- ## Where to find data **Browse Button in `RStudio`** <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\importBrowse.PNG" width="75%" style="display: block; margin: auto;" /> **Code preview in `Rstudio`** <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\codepreview.PNG" width="75%" style="display: block; margin: auto;" /> --- ## Honestly, after some time you will write the code directly .center[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\coding_cat.gif" style="display: block; margin: auto;" /> ] .footnote[[Source](https://media.giphy.com/media/LmNwrBhejkK9EFP504/source.gif)] --- ## Simple vs. not so simple file formats Basic file formats, such as CSV (comma-separated value file), can directly be imported into `R` - they are 'flat' - few metadata - basically text files Other file formats, particularly the proprietary ones, require the use of additional packages - they are complex - a lot of metadata (think of all the labels in an *SPSS* file) - they are binary (1110101) --- ## File formats are subject of war <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\norm_normal_file_format.png" width="30%" style="display: block; margin: auto;" /> https://xkcd.com/2116/ --- ## Personal sidenote: why `tidyverse` for importing? For simple files, `base R` provides proper tools for importing. Yet, for importing other files, we have to rely on additional packages anyway. - the `tidyverse` packages (and its "friends") allow us to import and export all kinds of different data formats in a coherent way - the tidy data format also facilitates adding metadata to imported data - they are tibbles - a specific kind are labelled data (more on that in a bit) - the `tidyverse` provides some sane defaults, e.g., by automatic data type detection --- ## Disclaimer **In the following slides, we'll jump right into importing data. We use a lot of different packages for this purpose, and you don't have to remember everything. It's just for making a point of how agnostic `R` actually is regarding the file type. Later on, we will dive more into the specifics of importing.** --- ## For starters: Importing a CSV file using `Base R` ```r titanic <- read.csv("./data/titanic.csv") titanic ``` .tiny[ ``` ## PassengerId Survived Pclass Name Sex Age ## 1 1 0 3 Braund, Mr. Owen Harris male 22.00 ## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.00 ## 3 3 1 3 Heikkinen, Miss. Laina female 26.00 ## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 ## 5 5 0 3 Allen, Mr. William Henry male 35.00 ## 6 6 0 3 Moran, Mr. James male NA ## 7 7 0 1 McCarthy, Mr. Timothy J male 54.00 ## 8 8 0 3 Palsson, Master. Gosta Leonard male 2.00 ## 9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00 ## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.00 ## 11 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 ## 12 12 1 1 Bonnell, Miss. Elizabeth female 58.00 ## 13 13 0 3 Saundercock, Mr. William Henry male 20.00 ## 14 14 0 3 Andersson, Mr. Anders Johan male 39.00 ## 15 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.00 ## 16 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.00 ## 17 17 0 3 Rice, Master. Eugene male 2.00 ## 18 18 1 2 Williams, Mr. Charles Eugene male NA ## 19 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) female 31.00 ## 20 20 1 3 Masselmani, Mrs. Fatima female NA ## 21 21 0 2 Fynney, Mr. Joseph J male 35.00 ## 22 22 1 2 Beesley, Mr. Lawrence male 34.00 ## 23 23 1 3 McGowan, Miss. Anna "Annie" female 15.00 ## 24 24 1 1 Sloper, Mr. William Thompson male 28.00 ## 25 25 0 3 Palsson, Miss. Torborg Danira female 8.00 ## 26 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) female 38.00 ## 27 27 0 3 Emir, Mr. Farred Chehab male NA ## 28 28 0 1 Fortune, Mr. Charles Alexander male 19.00 ## 29 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NA ## 30 30 0 3 Todoroff, Mr. Lalio male NA ## 31 31 0 1 Uruchurtu, Don. Manuel E male 40.00 ## 32 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NA ## 33 33 1 3 Glynn, Miss. Mary Agatha female NA ## 34 34 0 2 Wheadon, Mr. Edward H male 66.00 ## 35 35 0 1 Meyer, Mr. Edgar Joseph male 28.00 ## 36 36 0 1 Holverson, Mr. Alexander Oskar male 42.00 ## 37 37 1 3 Mamee, Mr. Hanna male NA ## 38 38 0 3 Cann, Mr. Ernest Charles male 21.00 ## 39 39 0 3 Vander Planke, Miss. Augusta Maria female 18.00 ## 40 40 1 3 Nicola-Yarred, Miss. Jamila female 14.00 ## 41 41 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.00 ## 42 42 0 2 Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott) female 27.00 ## 43 43 0 3 Kraeff, Mr. Theodor male NA ## 44 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 ## 45 45 1 3 Devaney, Miss. Margaret Delia female 19.00 ## 46 46 0 3 Rogers, Mr. William John male NA ## 47 47 0 3 Lennon, Mr. Denis male NA ## 48 48 1 3 O'Driscoll, Miss. Bridget female NA ## 49 49 0 3 Samaan, Mr. Youssef male NA ## 50 50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.00 ## 51 51 0 3 Panula, Master. Juha Niilo male 7.00 ## 52 52 0 3 Nosworthy, Mr. Richard Cater male 21.00 ## 53 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.00 ## 54 54 1 2 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson) female 29.00 ## 55 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.00 ## 56 56 1 1 Woolner, Mr. Hugh male NA ## 57 57 1 2 Rugg, Miss. Emily female 21.00 ## 58 58 0 3 Novel, Mr. Mansouer male 28.50 ## 59 59 1 2 West, Miss. Constance Mirium female 5.00 ## 60 60 0 3 Goodwin, Master. William Frederick male 11.00 ## 61 61 0 3 Sirayanian, Mr. Orsen male 22.00 ## 62 62 1 1 Icard, Miss. Amelie female 38.00 ## 63 63 0 1 Harris, Mr. Henry Birkhardt male 45.00 ## 64 64 0 3 Skoog, Master. Harald male 4.00 ## 65 65 0 1 Stewart, Mr. Albert A male NA ## 66 66 1 3 Moubarek, Master. Gerios male NA ## 67 67 1 2 Nye, Mrs. (Elizabeth Ramell) female 29.00 ## 68 68 0 3 Crease, Mr. Ernest James male 19.00 ## 69 69 1 3 Andersson, Miss. Erna Alexandra female 17.00 ## 70 70 0 3 Kink, Mr. Vincenz male 26.00 ## 71 71 0 2 Jenkin, Mr. Stephen Curnow male 32.00 ## 72 72 0 3 Goodwin, Miss. Lillian Amy female 16.00 ## 73 73 0 2 Hood, Mr. Ambrose Jr male 21.00 ## 74 74 0 3 Chronopoulos, Mr. Apostolos male 26.00 ## 75 75 1 3 Bing, Mr. Lee male 32.00 ## 76 76 0 3 Moen, Mr. Sigurd Hansen male 25.00 ## 77 77 0 3 Staneff, Mr. Ivan male NA ## 78 78 0 3 Moutal, Mr. Rahamin Haim male NA ## 79 79 1 2 Caldwell, Master. Alden Gates male 0.83 ## 80 80 1 3 Dowdell, Miss. Elizabeth female 30.00 ## 81 81 0 3 Waelens, Mr. Achille male 22.00 ## 82 82 1 3 Sheerlinck, Mr. Jan Baptist male 29.00 ## 83 83 1 3 McDermott, Miss. Brigdet Delia female NA ## SibSp Parch Ticket Fare Cabin Embarked ## 1 1 0 A/5 21171 7.2500 S ## 2 1 0 PC 17599 71.2833 C85 C ## 3 0 0 STON/O2. 3101282 7.9250 S ## 4 1 0 113803 53.1000 C123 S ## 5 0 0 373450 8.0500 S ## 6 0 0 330877 8.4583 Q ## 7 0 0 17463 51.8625 E46 S ## 8 3 1 349909 21.0750 S ## 9 0 2 347742 11.1333 S ## 10 1 0 237736 30.0708 C ## 11 1 1 PP 9549 16.7000 G6 S ## 12 0 0 113783 26.5500 C103 S ## 13 0 0 A/5. 2151 8.0500 S ## 14 1 5 347082 31.2750 S ## 15 0 0 350406 7.8542 S ## 16 0 0 248706 16.0000 S ## 17 4 1 382652 29.1250 Q ## 18 0 0 244373 13.0000 S ## 19 1 0 345763 18.0000 S ## 20 0 0 2649 7.2250 C ## 21 0 0 239865 26.0000 S ## 22 0 0 248698 13.0000 D56 S ## 23 0 0 330923 8.0292 Q ## 24 0 0 113788 35.5000 A6 S ## 25 3 1 349909 21.0750 S ## 26 1 5 347077 31.3875 S ## 27 0 0 2631 7.2250 C ## 28 3 2 19950 263.0000 C23 C25 C27 S ## 29 0 0 330959 7.8792 Q ## 30 0 0 349216 7.8958 S ## 31 0 0 PC 17601 27.7208 C ## 32 1 0 PC 17569 146.5208 B78 C ## 33 0 0 335677 7.7500 Q ## 34 0 0 C.A. 24579 10.5000 S ## 35 1 0 PC 17604 82.1708 C ## 36 1 0 113789 52.0000 S ## 37 0 0 2677 7.2292 C ## 38 0 0 A./5. 2152 8.0500 S ## 39 2 0 345764 18.0000 S ## 40 1 0 2651 11.2417 C ## 41 1 0 7546 9.4750 S ## 42 1 0 11668 21.0000 S ## 43 0 0 349253 7.8958 C ## 44 1 2 SC/Paris 2123 41.5792 C ## 45 0 0 330958 7.8792 Q ## 46 0 0 S.C./A.4. 23567 8.0500 S ## 47 1 0 370371 15.5000 Q ## 48 0 0 14311 7.7500 Q ## 49 2 0 2662 21.6792 C ## 50 1 0 349237 17.8000 S ## 51 4 1 3101295 39.6875 S ## 52 0 0 A/4. 39886 7.8000 S ## 53 1 0 PC 17572 76.7292 D33 C ## 54 1 0 2926 26.0000 S ## 55 0 1 113509 61.9792 B30 C ## 56 0 0 19947 35.5000 C52 S ## 57 0 0 C.A. 31026 10.5000 S ## 58 0 0 2697 7.2292 C ## 59 1 2 C.A. 34651 27.7500 S ## 60 5 2 CA 2144 46.9000 S ## 61 0 0 2669 7.2292 C ## 62 0 0 113572 80.0000 B28 ## 63 1 0 36973 83.4750 C83 S ## 64 3 2 347088 27.9000 S ## 65 0 0 PC 17605 27.7208 C ## 66 1 1 2661 15.2458 C ## 67 0 0 C.A. 29395 10.5000 F33 S ## 68 0 0 S.P. 3464 8.1583 S ## 69 4 2 3101281 7.9250 S ## 70 2 0 315151 8.6625 S ## 71 0 0 C.A. 33111 10.5000 S ## 72 5 2 CA 2144 46.9000 S ## 73 0 0 S.O.C. 14879 73.5000 S ## 74 1 0 2680 14.4542 C ## 75 0 0 1601 56.4958 S ## 76 0 0 348123 7.6500 F G73 S ## 77 0 0 349208 7.8958 S ## 78 0 0 374746 8.0500 S ## 79 0 2 248738 29.0000 S ## 80 0 0 364516 12.4750 S ## 81 0 0 345767 9.0000 S ## 82 0 0 345779 9.5000 S ## 83 0 0 330932 7.7875 Q ## [ reached 'max' / getOption("max.print") -- omitted 808 rows ] ``` ] --- ## A `tidyverse` / `readr` example ```r library(readr) titanic <- read_csv("./data/titanic.csv") ``` Please note the column specifications. `readr` 'guesses' them based on the first 1000 observations (we will come back to this later). --- class: middle .tinyish[ ```r titanic ``` ``` ## # A tibble: 891 x 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Har~ male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John B~ fema~ 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Lai~ fema~ 26 0 0 STON/O2.~ 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacqu~ fema~ 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William H~ male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timoth~ male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gos~ male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar ~ fema~ 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nichola~ fema~ 14 1 0 237736 30.1 <NA> C ## # ... with 881 more rows ``` ] It's that easy! --- ## A `readxl` example: `read_excel()` ```r library(readxl) unicorns <- read_xlsx("./data/observations.xlsx") ``` No output ☹️ --- class: middle ```r unicorns ``` ``` ## # A tibble: 42 x 3 ## countryname year pop ## <chr> <dbl> <dbl> ## 1 Austria 1670 85 ## 2 Austria 1671 83 ## 3 Austria 1674 75 ## 4 Austria 1675 82 ## 5 Austria 1676 79 ## 6 Austria 1677 70 ## 7 Austria 1678 81 ## 8 Austria 1680 80 ## 9 France 1673 70 ## 10 France 1674 79 ## # ... with 32 more rows ``` --- ## A `haven` example: `read_stata()` ```r library(haven) gp_covid <- read_stata("./data/ZA5667_v1-1-0_Stata14.dta") ``` --- ```r gp_covid ``` ``` ## za_number version doi id cohort sex age_cat education_cat intention_to_vote ## 1 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 1 3 1 7 3 2 ## 2 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 2 1 2 7 2 2 ## 3 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 3 3 1 8 2 2 ## 4 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 4 2 1 4 3 2 ## 5 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 5 1 2 1 3 -33 ## 6 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 6 1 1 10 2 2 ## 7 ZA5667 v1-1-0 2020-04-27 10.4232/1.13520 7 1 2 4 2 2 ## choice_of_party political_orientation marstat household hzcy001a hzcy002a hzcy003a hzcy004a hzcy005a ## 1 1 6 2 1 -33 -33 -33 -33 -33 ## 2 5 5 1 2 5 5 2 5 5 ## 3 1 5 1 2 5 6 3 6 6 ## 4 1 7 1 3 4 4 2 4 3 ## 5 -33 4 2 3 -33 -33 -33 -33 -33 ## 6 6 10 1 2 3 3 -99 3 3 ## 7 6 5 1 3 4 3 3 3 4 ## hzcy006a hzcy007a hzcy008a hzcy009a hzcy010a hzcy011a hzcy012a hzcy013a hzcy014a hzcy015a hzcy016a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 1 0 0 0 0 1 1 1 0 0 0 ## 3 1 1 0 0 0 1 0 0 1 0 0 ## 4 0 0 1 0 0 1 1 0 1 0 0 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 1 0 0 0 0 0 1 0 0 0 0 ## 7 1 1 0 0 0 1 0 0 1 0 0 ## hzcy018a hzcy019a hzcy020a hzcy021a hzcy022a hzcy023a hzcy024a hzcy025a hzcy026a hzcy027a hzcy028a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 0 5 5 5 5 5 5 5 4 -88 -88 ## 3 0 5 5 5 4 5 5 4 1 5 1 ## 4 0 3 4 4 3 4 2 2 1 4 2 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 0 4 4 4 4 4 2 3 1 5 2 ## 7 0 5 5 5 5 5 5 3 1 5 3 ## hzcy029a hzcy030a hzcy031a hzcy032a hzcy033a hzcy034a hzcy035a hzcy036a hzcy037a hzcy038a hzcy039a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 ## 3 4 5 5 5 -88 -88 -88 -88 -88 -88 -88 ## 4 3 3 3 3 -88 -88 -88 -88 -88 -88 -88 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 4 4 4 4 -88 -88 -88 -88 -88 -88 -88 ## 7 5 5 5 5 -88 -88 -88 -88 -88 -88 -88 ## hzcy040a hzcy041a hzcy042a hzcy043a hzcy044a hzcy045a hzcy046a hzcy047a hzcy048a hzcy049a hzcy050a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 3 3 3 3 5 4 4 5 4 4 4 ## 3 2 3 3 3 4 4 5 5 4 3 4 ## 4 3 4 3 3 4 4 4 4 4 4 4 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 2 3 3 2 4 5 4 5 4 2 4 ## 7 3 3 3 3 4 4 4 4 3 3 3 ## hzcy051a hzcy052a hzcy053a hzcy054a hzcy055a hzcy056a hzcy057a hzcy058a hzcy059a hzcy060a hzcy061a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 4 4 1 0 1 0 0 0 0 0 -88 ## 3 2 5 5 -88 -88 -88 -88 -88 -88 -88 -88 ## 4 4 4 1 0 0 0 0 0 0 1 -88 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 3 5 5 -88 -88 -88 -88 -88 -88 -88 -88 ## 7 2 4 1 0 0 0 0 0 0 1 -88 ## hzcy062a hzcy063a hzcy064a hzcy065a hzcy066a hzcy067a hzcy068a hzcy069a hzcy070a hzcy071a hzcy072a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 -88 -88 -88 -88 -88 -88 -88 -88 -88 2 -88 ## 3 -88 -88 -88 -88 -88 -88 -88 -88 -88 2 -88 ## 4 -88 -88 -88 -88 -88 -88 -88 -88 -88 1 1 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 -88 -88 -88 -88 -88 -88 -88 -88 -88 2 -88 ## 7 -88 -88 -88 -88 -88 -88 -88 -88 -88 1 1 ## hzcy073a hzcy074a hzcy075a hzcy076a hzcy077a hzcy078a hzcy079a hzcy080a hzcy081a hzcy083a hzcy084a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 1 ## 3 -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 1 ## 4 0 0 0 0 0 0 0 0 0 0 1 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 -88 -88 -88 -88 -88 -88 -88 -88 -88 -88 1 ## 7 1 0 0 0 0 0 0 1 0 0 0 ## hzcy085a hzcy086a hzcy087a hzcy088a hzcy089a hzcy090a hzcy091a hzcy092a hzcy093a hzcy095a hzcy096a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 1 0 0 0 1 1 0 0 1 0 4 ## 3 1 0 1 0 1 0 0 1 0 0 -88 ## 4 0 0 0 0 0 0 0 0 0 0 -88 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 0 0 0 0 1 0 0 0 0 0 -88 ## 7 1 0 0 0 0 0 0 0 0 0 -88 ## hzcy097a hzcy098a hzcy099a hzza001a hzza002a hzza003a hzzq009a hzzq016b hzzq023a hzzp201a hzzp204a ## 1 -33 -33 -33 1 0 0 -33 -33 -33 -33 -33 ## 2 0 0 1 1 1 1 4 0 5 31 210 ## 3 -88 -88 -88 1 1 1 5 0 5 31 377 ## 4 -88 -88 -88 1 1 1 4 0 4 31 309 ## 5 -33 -33 -33 1 0 0 -33 -33 -33 -33 -33 ## 6 -88 -88 -88 1 1 1 4 0 4 31 429 ## 7 -88 -88 -88 1 1 1 3 0 4 31 586 ## hzzp207a hzzr001a hzzr002a hzzr003a hzzr004a hzzr005a hzzr006a hzzr007a hzzr008a hzzr009a hzzr010a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 1584549879 3 24 48 71 82 0 0 101 130 145 ## 3 1584469614 34 83 117 161 175 206 0 245 293 0 ## 4 1584525461 4 35 67 101 110 140 0 166 190 216 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 1584461540 3 41 90 143 159 209 0 288 340 0 ## 7 1584823080 7 67 121 212 230 264 0 319 388 438 ## hzzr011a hzzr012a hzzr013a hzzr014a hzzr015a hzzr016a hzzr017a hzzr018a hzzr019a ## 1 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 2 150 0 189 193 200 209 210 138 0 ## 3 312 0 345 0 0 360 377 307 0 ## 4 222 246 266 0 0 307 309 206 0 ## 5 -33 -33 -33 -33 -33 -33 -33 -33 -33 ## 6 366 0 412 0 0 424 429 360 0 ## 7 446 509 558 0 0 576 586 416 0 ## [ reached 'max' / getOption("max.print") -- omitted 3758 rows ] ``` --- ## `read_stata()`'s sister: `read_spss()` Indeed, there's also the function `read_spss()` to import *SPSS* files. It also provides capabilities to handle *SPSS*-defined missing values by setting the option `user_na = TRUE` (default is `FALSE`). *Note*: The [`sjlabelled` package](https://cran.r-project.org/web/packages/sjlabelled/index.html) can also be used to choose a more elaborated approach for missing values: https://cran.r-project.org/web/packages/sjlabelled/vignettes/intro_sjlabelled.html **We will come back to Stata and SPSS files since they depict a specific file format in `R`: labelled data.** --- ## There's more These were just some very first examples of applying functions from each package. They comprise even more functions for different data types. - `readr` - `read_csv()` - `read_tsv()` - `read_delim()` - `read_fwf()` - `read_table()` - `read_log()` - `haven` - `read_sas()` - `read_spss()` - `read_stata()` Not to mention all the helper functions and options. For example, we can define the cells to read from an *Excel* file by specifying the option `range = "C1:E4"` in `read_excel()` --- ## Data type specifications for `tibbles` - characters - indicated by `<chr>` - specified by `col_character()` - integers - indicated by `<int>` - specified by `col_integer()` - doubles - indicated by `<dbl>` - specified by `col_double()` - factors - indicated by `<fct>` - specified by `col_factor()` - logical - indicated by `<lgl>` - specified by `col_logical()` .center[**There's more, but we'll leave it at that for now.**] --- ## Changing variable types, e.g., in CSV files As mentioned before, `read_csv` 'guesses' the variable types by scanning the first 1000 observations. **NB**: This can go wrong! Luckily, we can change the variable type... - before/while loading the data - and after loading the data --- ## While loading the data in `read_csv` .tinyish[ ```r titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_character(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) ) titanic ``` ``` ## # A tibble: 891 x 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Har~ male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John B~ fema~ 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Lai~ fema~ 26 0 0 STON/O2.~ 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacqu~ fema~ 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William H~ male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timoth~ male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gos~ male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar ~ fema~ 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nichola~ fema~ 14 1 0 237736 30.1 <NA> C ## # ... with 881 more rows ``` ] .right[↪️] --- .tinyish[ ``` ## # A tibble: 891 x 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Har~ male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John B~ fema~ 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Lai~ fema~ 26 0 0 STON/O2.~ 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacqu~ fema~ 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William H~ male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timoth~ male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gos~ male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar ~ fema~ 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nichola~ fema~ 14 1 0 237736 30.1 <NA> C ## # ... with 881 more rows ``` ] --- ## While loading the data in `read_csv` .tinyish[ ```r titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), # This one changed! Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) ) titanic ``` ``` ## # A tibble: 891 x 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Har~ male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John B~ fema~ 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Lai~ fema~ 26 0 0 STON/O2.~ 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacqu~ fema~ 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William H~ male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timoth~ male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gos~ male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar ~ fema~ 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nichola~ fema~ 14 1 0 237736 30.1 <NA> C ## # ... with 881 more rows ``` ] .right[↪️] --- .tinyish[ ] --- ## After loading the data ```r titanic <- readr::type_convert( titanic, col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) ) ``` --- ## Beyond flat files: labelled data A lot of data you get and find or even collect comes in some sort of flat file format, such as CSV. In the social sciences, however, we often deal with proprietary file formats, such as *SPSS*'s `.sav` or *Stata*'s `.dta` files. What we often find in these data are labels. These labels are used to describe variables or variable values. They comprise some specific metadata inherent in these proprietary file formats. *If you were able to travel ten years back in time and ask an `R` geek, she'd say that you cannot use labels in R. You'd either have to import, e.g., value labels as character strings or use their codes as factors. However, these days...* --- ## Not being able to use labelled data is the past Nowadays, if you use the `haven` package, labels are built-in. For example: ```r gp_covid["age_cat"] ``` ``` ## age_cat ## 1 7 ## 2 7 ## 3 8 ## 4 4 ## 5 1 ## 6 10 ## 7 4 ## 8 7 ## 9 8 ## 10 1 ## 11 6 ## 12 8 ## 13 2 ## 14 6 ## 15 2 ## 16 2 ## 17 2 ## 18 7 ## 19 4 ## 20 8 ## 21 1 ## 22 7 ## 23 4 ## 24 3 ## 25 5 ## 26 7 ## 27 7 ## 28 6 ## 29 6 ## 30 5 ## 31 7 ## 32 7 ## 33 5 ## 34 7 ## 35 5 ## 36 2 ## 37 6 ## 38 2 ## 39 9 ## 40 7 ## 41 5 ## 42 7 ## 43 7 ## 44 5 ## 45 7 ## 46 8 ## 47 5 ## 48 10 ## 49 7 ## 50 9 ## 51 8 ## 52 9 ## 53 2 ## 54 7 ## 55 6 ## 56 10 ## 57 10 ## 58 3 ## 59 6 ## 60 9 ## 61 2 ## 62 9 ## 63 4 ## 64 7 ## 65 3 ## 66 4 ## 67 4 ## 68 7 ## 69 7 ## 70 5 ## 71 2 ## 72 7 ## 73 9 ## 74 8 ## 75 7 ## 76 8 ## 77 5 ## 78 9 ## 79 5 ## 80 7 ## 81 7 ## 82 7 ## 83 5 ## 84 9 ## 85 7 ## 86 8 ## 87 2 ## 88 7 ## 89 7 ## 90 8 ## 91 10 ## 92 6 ## 93 8 ## 94 4 ## 95 7 ## 96 5 ## 97 3 ## 98 5 ## 99 10 ## 100 3 ## 101 9 ## 102 10 ## 103 7 ## 104 5 ## 105 3 ## 106 9 ## 107 6 ## 108 5 ## 109 7 ## 110 5 ## 111 4 ## 112 5 ## 113 7 ## 114 10 ## 115 8 ## 116 5 ## 117 4 ## 118 8 ## 119 1 ## 120 6 ## 121 3 ## 122 3 ## 123 4 ## 124 6 ## 125 8 ## 126 1 ## 127 3 ## 128 10 ## 129 9 ## 130 9 ## 131 9 ## 132 7 ## 133 8 ## 134 6 ## 135 3 ## 136 6 ## 137 7 ## 138 10 ## 139 7 ## 140 5 ## 141 8 ## 142 2 ## 143 7 ## 144 6 ## 145 2 ## 146 10 ## 147 9 ## 148 6 ## 149 8 ## 150 2 ## 151 7 ## 152 7 ## 153 7 ## 154 7 ## 155 9 ## 156 10 ## 157 7 ## 158 3 ## 159 3 ## 160 3 ## 161 3 ## 162 4 ## 163 4 ## 164 7 ## 165 5 ## 166 10 ## 167 6 ## 168 2 ## 169 6 ## 170 2 ## 171 4 ## 172 4 ## 173 7 ## 174 7 ## 175 7 ## 176 9 ## 177 7 ## 178 5 ## 179 6 ## 180 7 ## 181 7 ## 182 9 ## 183 8 ## 184 7 ## 185 8 ## 186 7 ## 187 4 ## 188 7 ## 189 8 ## 190 8 ## 191 10 ## 192 6 ## 193 2 ## 194 7 ## 195 10 ## 196 10 ## 197 8 ## 198 6 ## 199 7 ## 200 5 ## 201 2 ## 202 6 ## 203 5 ## 204 10 ## 205 9 ## 206 7 ## 207 7 ## 208 7 ## 209 6 ## 210 10 ## 211 6 ## 212 1 ## 213 8 ## 214 10 ## 215 8 ## 216 6 ## 217 3 ## 218 9 ## 219 7 ## 220 3 ## 221 7 ## 222 6 ## 223 10 ## 224 4 ## 225 10 ## 226 5 ## 227 7 ## 228 10 ## 229 3 ## 230 9 ## 231 8 ## 232 7 ## 233 2 ## 234 7 ## 235 9 ## 236 8 ## 237 6 ## 238 8 ## 239 2 ## 240 7 ## 241 7 ## 242 7 ## 243 7 ## 244 9 ## 245 4 ## 246 6 ## 247 7 ## 248 7 ## 249 7 ## 250 7 ## 251 8 ## 252 7 ## 253 4 ## 254 10 ## 255 7 ## 256 10 ## 257 2 ## 258 7 ## 259 3 ## 260 9 ## 261 7 ## 262 7 ## 263 4 ## 264 8 ## 265 6 ## 266 5 ## 267 6 ## 268 7 ## 269 5 ## 270 9 ## 271 7 ## 272 9 ## 273 3 ## 274 2 ## 275 10 ## 276 3 ## 277 7 ## 278 7 ## 279 4 ## 280 8 ## 281 5 ## 282 10 ## 283 6 ## 284 7 ## 285 3 ## 286 7 ## 287 3 ## 288 7 ## 289 2 ## 290 2 ## 291 9 ## 292 1 ## 293 10 ## 294 2 ## 295 9 ## 296 9 ## 297 5 ## 298 6 ## 299 7 ## 300 9 ## 301 7 ## 302 7 ## 303 4 ## 304 8 ## 305 9 ## 306 7 ## 307 4 ## 308 8 ## 309 9 ## 310 6 ## 311 10 ## 312 3 ## 313 4 ## 314 4 ## 315 7 ## 316 4 ## 317 8 ## 318 9 ## 319 8 ## 320 8 ## 321 7 ## 322 7 ## 323 4 ## 324 6 ## 325 2 ## 326 9 ## 327 3 ## 328 6 ## 329 7 ## 330 7 ## 331 8 ## 332 7 ## 333 5 ## 334 7 ## 335 5 ## 336 8 ## 337 7 ## 338 9 ## 339 4 ## 340 9 ## 341 4 ## 342 6 ## 343 4 ## 344 4 ## 345 10 ## 346 7 ## 347 7 ## 348 6 ## 349 5 ## 350 9 ## 351 6 ## 352 7 ## 353 4 ## 354 6 ## 355 3 ## 356 10 ## 357 6 ## 358 7 ## 359 7 ## 360 6 ## 361 10 ## 362 10 ## 363 3 ## 364 8 ## 365 7 ## 366 9 ## 367 7 ## 368 6 ## 369 9 ## 370 6 ## 371 10 ## 372 7 ## 373 5 ## 374 7 ## 375 3 ## 376 9 ## 377 3 ## 378 3 ## 379 6 ## 380 7 ## 381 7 ## 382 9 ## 383 7 ## 384 7 ## 385 9 ## 386 7 ## 387 9 ## 388 9 ## 389 4 ## 390 10 ## 391 9 ## 392 3 ## 393 9 ## 394 6 ## 395 1 ## 396 10 ## 397 8 ## 398 9 ## 399 9 ## 400 8 ## 401 3 ## 402 2 ## 403 2 ## 404 7 ## 405 2 ## 406 5 ## 407 10 ## 408 3 ## 409 3 ## 410 9 ## 411 5 ## 412 9 ## 413 9 ## 414 6 ## 415 9 ## 416 5 ## 417 7 ## 418 4 ## 419 9 ## 420 7 ## 421 5 ## 422 5 ## 423 10 ## 424 2 ## 425 7 ## 426 6 ## 427 8 ## 428 7 ## 429 3 ## 430 6 ## 431 5 ## 432 7 ## 433 3 ## 434 7 ## 435 7 ## 436 7 ## 437 6 ## 438 7 ## 439 10 ## 440 4 ## 441 6 ## 442 5 ## 443 6 ## 444 7 ## 445 7 ## 446 5 ## 447 2 ## 448 3 ## 449 10 ## 450 1 ## 451 5 ## 452 7 ## 453 4 ## 454 7 ## 455 3 ## 456 10 ## 457 10 ## 458 7 ## 459 1 ## 460 10 ## 461 4 ## 462 2 ## 463 7 ## 464 8 ## 465 5 ## 466 7 ## 467 1 ## 468 3 ## 469 8 ## 470 5 ## 471 1 ## 472 9 ## 473 7 ## 474 8 ## 475 9 ## 476 9 ## 477 10 ## 478 5 ## 479 4 ## 480 7 ## 481 4 ## 482 2 ## 483 8 ## 484 1 ## 485 8 ## 486 7 ## 487 5 ## 488 7 ## 489 5 ## 490 2 ## 491 8 ## 492 10 ## 493 9 ## 494 8 ## 495 5 ## 496 7 ## 497 7 ## 498 8 ## 499 5 ## 500 7 ## 501 8 ## 502 5 ## 503 7 ## 504 3 ## 505 3 ## 506 9 ## 507 4 ## 508 5 ## 509 2 ## 510 9 ## 511 10 ## 512 9 ## 513 2 ## 514 7 ## 515 7 ## 516 4 ## 517 10 ## 518 5 ## 519 4 ## 520 2 ## 521 9 ## 522 2 ## 523 6 ## 524 9 ## 525 7 ## 526 2 ## 527 4 ## 528 9 ## 529 2 ## 530 10 ## 531 9 ## 532 2 ## 533 5 ## 534 3 ## 535 3 ## 536 9 ## 537 10 ## 538 8 ## 539 7 ## 540 7 ## 541 3 ## 542 5 ## 543 5 ## 544 10 ## 545 8 ## 546 10 ## 547 7 ## 548 5 ## 549 8 ## 550 7 ## 551 8 ## 552 9 ## 553 7 ## 554 4 ## 555 7 ## 556 6 ## 557 7 ## 558 7 ## 559 8 ## 560 6 ## 561 7 ## 562 8 ## 563 7 ## 564 9 ## 565 7 ## 566 3 ## 567 2 ## 568 10 ## 569 10 ## 570 6 ## 571 5 ## 572 7 ## 573 3 ## 574 6 ## 575 10 ## 576 8 ## 577 8 ## 578 10 ## 579 10 ## 580 9 ## 581 7 ## 582 9 ## 583 3 ## 584 7 ## 585 3 ## 586 10 ## 587 7 ## 588 7 ## 589 8 ## 590 4 ## 591 10 ## 592 7 ## 593 2 ## 594 5 ## 595 6 ## 596 5 ## 597 5 ## 598 10 ## 599 8 ## 600 7 ## 601 6 ## 602 5 ## 603 7 ## 604 10 ## 605 7 ## 606 7 ## 607 3 ## 608 2 ## 609 7 ## 610 10 ## 611 9 ## 612 9 ## 613 9 ## 614 6 ## 615 4 ## 616 8 ## 617 5 ## 618 7 ## 619 10 ## 620 5 ## 621 6 ## 622 9 ## 623 4 ## 624 7 ## 625 7 ## 626 5 ## 627 7 ## 628 7 ## 629 10 ## 630 3 ## 631 3 ## 632 8 ## 633 4 ## 634 9 ## 635 6 ## 636 8 ## 637 1 ## 638 5 ## 639 5 ## 640 7 ## 641 2 ## 642 6 ## 643 4 ## 644 4 ## 645 8 ## 646 8 ## 647 7 ## 648 6 ## 649 10 ## 650 8 ## 651 8 ## 652 7 ## 653 4 ## 654 7 ## 655 7 ## 656 10 ## 657 4 ## 658 6 ## 659 9 ## 660 7 ## 661 6 ## 662 10 ## 663 7 ## 664 7 ## 665 7 ## 666 7 ## 667 6 ## 668 6 ## 669 7 ## 670 7 ## 671 3 ## 672 9 ## 673 9 ## 674 10 ## 675 3 ## 676 4 ## 677 10 ## 678 5 ## 679 7 ## 680 8 ## 681 10 ## 682 5 ## 683 2 ## 684 4 ## 685 8 ## 686 1 ## 687 1 ## 688 5 ## 689 7 ## 690 7 ## 691 4 ## 692 7 ## 693 3 ## 694 7 ## 695 7 ## 696 5 ## 697 4 ## 698 2 ## 699 10 ## 700 7 ## 701 3 ## 702 3 ## 703 9 ## 704 5 ## 705 7 ## 706 7 ## 707 6 ## 708 4 ## 709 8 ## 710 7 ## 711 9 ## 712 3 ## 713 6 ## 714 5 ## 715 9 ## 716 5 ## 717 8 ## 718 10 ## 719 2 ## 720 1 ## 721 4 ## 722 6 ## 723 6 ## 724 4 ## 725 8 ## 726 4 ## 727 4 ## 728 10 ## 729 4 ## 730 7 ## 731 5 ## 732 1 ## 733 5 ## 734 3 ## 735 2 ## 736 7 ## 737 6 ## 738 7 ## 739 2 ## 740 9 ## 741 4 ## 742 7 ## 743 1 ## 744 6 ## 745 8 ## 746 4 ## 747 6 ## 748 10 ## 749 7 ## 750 7 ## 751 2 ## 752 5 ## 753 10 ## 754 5 ## 755 5 ## 756 4 ## 757 3 ## 758 7 ## 759 9 ## 760 6 ## 761 4 ## 762 4 ## 763 8 ## 764 10 ## 765 8 ## 766 6 ## 767 7 ## 768 7 ## 769 7 ## 770 7 ## 771 10 ## 772 6 ## 773 1 ## 774 7 ## 775 2 ## 776 8 ## 777 10 ## 778 9 ## 779 7 ## 780 4 ## 781 7 ## 782 5 ## 783 3 ## 784 7 ## 785 4 ## 786 2 ## 787 5 ## 788 9 ## 789 7 ## 790 9 ## 791 9 ## 792 10 ## 793 5 ## 794 8 ## 795 4 ## 796 7 ## 797 5 ## 798 3 ## 799 6 ## 800 9 ## 801 4 ## 802 7 ## 803 10 ## 804 2 ## 805 8 ## 806 1 ## 807 8 ## 808 7 ## 809 8 ## 810 2 ## 811 7 ## 812 5 ## 813 4 ## 814 7 ## 815 5 ## 816 6 ## 817 6 ## 818 8 ## 819 9 ## 820 10 ## 821 7 ## 822 7 ## 823 4 ## 824 6 ## 825 7 ## 826 4 ## 827 9 ## 828 6 ## 829 7 ## 830 2 ## 831 1 ## 832 10 ## 833 7 ## 834 10 ## 835 7 ## 836 3 ## 837 10 ## 838 7 ## 839 4 ## 840 4 ## 841 6 ## 842 3 ## 843 8 ## 844 7 ## 845 5 ## 846 7 ## 847 2 ## 848 7 ## 849 2 ## 850 3 ## 851 7 ## 852 2 ## 853 7 ## 854 8 ## 855 7 ## 856 5 ## 857 2 ## 858 3 ## 859 7 ## 860 10 ## 861 2 ## 862 4 ## 863 7 ## 864 8 ## 865 10 ## 866 7 ## 867 7 ## 868 3 ## 869 7 ## 870 9 ## 871 7 ## 872 4 ## 873 4 ## 874 6 ## 875 10 ## 876 8 ## 877 5 ## 878 7 ## 879 1 ## 880 7 ## 881 8 ## 882 10 ## 883 10 ## 884 8 ## 885 8 ## 886 4 ## 887 6 ## 888 9 ## 889 5 ## 890 6 ## 891 6 ## 892 3 ## 893 4 ## 894 10 ## 895 8 ## 896 2 ## 897 10 ## 898 2 ## 899 10 ## 900 4 ## 901 9 ## 902 2 ## 903 7 ## 904 2 ## 905 7 ## 906 10 ## 907 5 ## 908 7 ## 909 7 ## 910 9 ## 911 5 ## 912 2 ## 913 5 ## 914 7 ## 915 6 ## 916 3 ## 917 10 ## 918 2 ## 919 3 ## 920 8 ## 921 5 ## 922 6 ## 923 7 ## 924 4 ## 925 2 ## 926 7 ## 927 5 ## 928 7 ## 929 5 ## 930 7 ## 931 7 ## 932 5 ## 933 5 ## 934 5 ## 935 7 ## 936 6 ## 937 7 ## 938 8 ## 939 5 ## 940 7 ## 941 8 ## 942 8 ## 943 4 ## 944 1 ## 945 7 ## 946 4 ## 947 5 ## 948 9 ## 949 7 ## 950 7 ## 951 3 ## 952 10 ## 953 7 ## 954 2 ## 955 7 ## 956 1 ## 957 4 ## 958 7 ## 959 6 ## 960 7 ## 961 7 ## 962 5 ## 963 9 ## 964 7 ## 965 7 ## 966 8 ## 967 7 ## 968 4 ## 969 8 ## 970 7 ## 971 9 ## 972 7 ## 973 6 ## 974 7 ## 975 1 ## 976 10 ## 977 7 ## 978 9 ## 979 8 ## 980 7 ## 981 9 ## 982 8 ## 983 10 ## 984 8 ## 985 2 ## 986 5 ## 987 10 ## 988 7 ## 989 1 ## 990 4 ## 991 1 ## 992 3 ## 993 4 ## 994 7 ## 995 7 ## 996 8 ## 997 7 ## 998 7 ## 999 3 ## 1000 6 ## [ reached 'max' / getOption("max.print") -- omitted 2765 rows ] ``` --- ## Advantages of using labelled data One could rejoice in not having to use a codebook any more, just like in *SPSS*. And I think by and large this is true, although just looking at code output for glimpsing at data is somewhat... geeky. An advantage definitely is that you could re-use the labels in figures and plots, some packages do that automatically, such as the [`sjPlot`](https://strengejacke.github.io/sjPlot/) package. Yet, primarily when you exchange your data with colleagues who do not use `R` or when you plan to publish your data (which you always should if that is possible), being able to export data you have manipulated in `R` is great. - ... and, yes, you can do that with labelled data as well. **However, be aware of the missing values hell that you may have to enter due to different missing values definitions in Stata and SPSS.** --- ## Manipulating labels I used to be a data ingest and preparation guy at the *GESIS* Data Archive. For this job, I had to use *SPSS* or *Stata* for my work, albeit 'privately' I worked with `R` all the time for my dissertation. I would have let Elon Musk name my firstborn child<sup>1</sup>, if I could have been able to perform all these tasks in `R`. Luckily, the generation after me at least could start to use `R` now for labeling or relabelling data with additional packages. One of those is the `sjlabelled` package from Daniel Lüdecke. .footnote[ [1]The name of his child is X Æ A-12 ] --- ## Getting labels ### Variables ```r sjlabelled::get_label(gp_covid$age_cat) ``` ``` ## [1] "Alter, kategorisiert" ``` ### Values .tinyish[ ```r sjlabelled::get_labels(gp_covid$age_cat) ``` ``` ## [1] "<=25 Jahre" "26 bis 30 Jahre" "31 bis 35 Jahre" "36 bis 40 Jahre" "41 bis 45 Jahre" ## [6] "46 bis 50 Jahre" "51 bis 60 Jahre" "61 bis 65 Jahre" "66 bis 70 Jahre" ">=71 Jahre" ``` ] --- ## And setting labels: Variables ```r gp_covid$age_cat <- sjlabelled::set_label(gp_covid$age_cat, label = "Age, categorized") sjlabelled::get_label(gp_covid$age_cat) ``` ``` ## [1] "Age, categorized" ``` --- ## And setting labels: Values .tinyish[ ```r gp_covid$age_cat <- sjlabelled::set_labels( gp_covid$age_cat, labels = c( "<=25 years", "26 to 30 years", "31 to 35 years", "36 to 40 years", "41 to 45 years", "46 to 50 years", "51 to 60 years", "61 to 65 years", "66 to 70 years", ">=71 years" ) ) sjlabelled::get_labels(gp_covid$age_cat) ``` ``` ## [1] "<=25 years" "26 to 30 years" "31 to 35 years" "36 to 40 years" "41 to 45 years" "46 to 50 years" ## [7] "51 to 60 years" "61 to 65 years" "66 to 70 years" ">=71 years" ``` ] --- ## That's a lot of manual work Yeah, this requires some tedious manual work that has to be done, at least by somebody. But that's just how it is, even in *SPSS* or *Stata*. Indeed, we may want to wait until using it in `R` scales a bit more. Integrating basic labelling of variables in a pipe workflow, however, is already straightforward: ```r gp_data_subset <- gp_covid %>% dplyr::select(age_cat, sex) %>% sjlabelled::var_labels( age_cat = "Age in Categories", sex = "Gender in Binary Form" ) sjlabelled::get_label(gp_data_subset) ``` ``` ## age_cat sex ## "Age in Categories" "Gender in Binary Form" ``` Ok, but that's already data wrangling, a topic for this afternoon. --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_1_2_2_Flat_Files.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_1_2_2_Flat_Files.html) --- ## Exporting data Sometimes our data have to leave `R`, for example, if we.... - share data with colleagues who do not use `R` - want to continue where we left off - particularly if data wrangling took a long time For such purposes we also need a way to export our data. All of the packages we have discussed in this session also have designated functions for that. <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\export_data.png" width="50%" style="display: block; margin: auto;" /> --- ## Examples: CSV and Stata files ```r write_csv(titanic, "titanic_own.csv") ``` ```r write_dta(titanic, "titanic_own.dta") ``` Proof that they have been exported: ```r list.files() ``` ``` ## [1] "Boxlots.png" "content" "data" ## [4] "exercises" "my_scripts" "r-intro-gesis-2021.Rproj" ## [7] "README.md" "slides" "solutions" ## [10] "titanic_own.csv" "titanic_own.dta" "to_delete" ``` --- ## `R`'s native file formats If you plan to continue to work with `R` (something we would always recommend 😜), there are at least two native 'file formats' to choose from. The advantage of using them is that they are compressed files, so that they don't occupy unnecessarily large disk space. Moreover, they are already prepared as you left them, and they take less time to be loaded (not a big deal in a small data world but relevant for big(ger) data). `.Rdata`/`.rda` files saving and loading: ```r save(mydata, file = "mydata.RData") load("mydata.RData") ``` `.rds` files saving and loading. ```r saveRDS(mydata, "mydata.rds") mydata <- readRDS("mydata.rds") ``` `saveRDS()` just saves a representation of the object, which means you can name it whatever you want when loading. --- ## Saving just everything If you have not changed the General Global Options in *RStudio* as suggested in the *Getting Started* session, you may have noticed that, when closing *Rstudio*, by default, the programs asks you whether you want to save the workspace image. <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\save_image.png" width="50%" style="display: block; margin: auto;" /> You can also do that whenever you want using the `save.image()` function: ```r save.image(file = "my_fancy_workspace.RData") ``` --- ## Additional packages The great benefit of `tidyverse` import functions is the import of the data as tibbles: the data are potentially tidier. Several other non-tidyverse packages provide similar benefits as they make use of this universal data format: - [`sf`](https://github.com/r-spatial/sf) for geospatial data - [`sjlabelled`](https://cran.r-project.org/web/packages/sjlabelled/index.html) to work with labelled data, e.g., from *SPSS* or *Stata* --- ## Other packages for data import - `base` R - the [`foreign` package](https://cran.r-project.org/web/packages/foreign/index.html) for *SPSS* and *Stata* files - [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) or [`fst`](https://www.fstpackage.org/) for large datasets - [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) for `.json` files - [`datapasta`](https://github.com/MilesMcBain/datapasta) for copying and pasting data into tribbles (e.g., from websites, *Excel* or *Word* files) --- ## Final note on file paths There is this simple rule of never using absolute file paths to maintain your code reproducibly and future-proof. We already talked about this in the introduction, so it's just to remind you as this is particularly important for data importing and exporting. ```r # Windows load("C:/Users/cool_user/data/fancy_data.Rdata") # Mac load("/Users/cool_user/data/fancy_data.Rdata") # GNU/Linux load("/home/cool_user/data/fancy_data.Rdata") ``` --- ## Use relative paths Instead of using absolute paths, it is recommended to use relative file paths. The general principle here is to start from a directory where your current script currently exists and navigate to your target location. Say we are in the "C:/Users/cool_user/" location on a Windows machine. To load your data, we would use: ```r load("./data/fancy_data.Rdata") ``` If we were in a different folder, e.g., "C:/Users/cool_user/cat_pics/mauzi/", we would use: ```r load("../../data/fancy_data.Rdata") ``` --- class: center, middle Please first download the [Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany](https://search.gesis.org/research_data/ZA5667) as .sav, .dta, and .csv file. # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_1_2_3_Statistical_Software_Files.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_1_2_3_Statistical_Software_Files.html)